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Using  Synthetic-Perturbation  Techniques 
for  Tuning  Shared  Memory  Programs  ^ 


Robert  Snelick 
Joseph  JaJa^ 

Raghu  Kacker 
Gordon  Lyon 

National  Institute  of  Standards  and  Technology^ 
Gaithersburg,  Maryland  20899 

Abstract 

The  Synthetic-Perturbation  Tuning  (SPT)  methodology 
is  based  on  an  empirical  approach  that  introduces  artificial 
delays  into  the  MIMD  program  and  captures  the  effects  of 
such  delays  by  using  the  modern  branch  of  statistics  called 
design  of  experiments.  SPT  provides  the  basis  of  a pow^er- 
ful  tool  for  tuning  MIMD  programs  that  is  portable  across 
machines  and  architectures.  The  purpose  of  this  paper  is  to 
explain  the  general  approach  and  to  extend  it  to  address  spe- 
cific features  that  are  the  main  source  of  poor  performance 
on  the  shared  memory  programming  model.  These  include 
performance  degradation  due  to  load  imbalance  and  insuf- 
ficient parallelism,  overhead  introduced  by  synchronizations 
and  by  accessing  shared  data  structures,  and  compute  time 
bottlenecks.  We  illustrate  the  practicality  of  SPT  by  demon- 
strating its  use  on  two  very  different  case  studies:  a large 
image  processing  benchmark  and  a parallel  quicksort. 


Key  words:  - design  of  experiments,  parallel  programs,  performance,  shared 
memory  programming  model,  synthetic-perturbation,  tuning 
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1 Introduction 


Today’s  multiprocessors  provide  unprecedented  performance  potential,  yet 
all  too  often  the  actual  performance  obtained  is  far  less  impressive.  Since 
their  inception,  a deficiency  of  multiprocessor  computers  has  been  the  lack  of 
adequate  performance  measurement  and  debugging  tools.  The  inherent  com- 
plexity of  parallel  programs  makes  it  far  more  difficult  to  capture  true  perfor- 
mance measurements  on  multiple-instruction  stream,  multiple-data  stream 
(MIMD)  architectures.  In  the  absence  of  MIMD  performance  tools,  obtain- 
ing reasonable  parallel  program  performance  is  no  small  undertaking.  Our 
objective  here  is  to  explain,  extend,  and  apply  a technique  that  gives  the 
programmer  useful  performance  information  and  is  portable  across  machines 
as  well  as  architectures.  The  technique  works  equally  well  in  both  shared 
memory  and  message  passing  environments.  This  work  emphasizes  the  SPT 
techniques  for  shared  memory  programs. 

Many  existing  tools  [6,  8,  9,  10,  11,  12]  focus  on  capturing  performance 
metrics  via  monitoring.  Performance  metrics  for  parallel  programs  can  pro- 
vide an  overwhelming  amount  of  internal  detail  that  is  difficult  to  relate  to 
performance  bottlenecks.  Our  approach  identifies  sources  of  performance 
degradation  via  a sensitivity  analysis  which  links  program  bottlenecks  di- 
rectly to  the  source  code.  Synthetic-Perturbation  Tuning  (SPT)[1]  introduces 
the  notion  of  inserting  user-induced  artificial  delays  into  the  source  code  and 
capturing  the  effect  of  such  delays  by  employing  design  of  experiments  tech- 
niques. 

In  the  rest  of  this  section  we  describe  the  problems  associated  with  con- 
ventional profiling  techniques  when  applied  to  MIMD  architectures.  We  also 
report  on  existing  tools  for  tuning  parallel  program  performance.  Finally, 
we  give  an  argument  for  program  sensitivity  analysis  without  conventional 
profiles.  A step-by-step  methodology  of  SPT  is  presented  in  Section  2.  In  Sec- 
tion 3 we  extend  the  technique  to  address  specific  features  that  are  the  source 
of  poor  performance  on  the  shared  memory  programming  model.  Sources 
of  performance  degradation  include  load  imbalance,  insufficient  parallelism, 
synchronization,  critical  sections,  and  compute  time  bottlenecks.  Section  4 
illustrates  the  practicality  of  SPT  by  demonstrating  its  use  on  two  case  stud- 
ies (an  image  processing  benchmark  and  a parallel  quicksort).  The  last  section 
(Section  5)  draws  conclusions  and  describes  future  plans. 


2 


1.1  Motivation 


Performance  statistics  have  long  been  used  to  improve  program  execution 
efficiencies  [18,  19,  7].  The  most  common  statistics  are  frequency  counts  and 
timings  for  segments  of  code.  Segments  can  be  procedures  or  smaller  entities, 
such  as  pieces  of  straight  line  code.  Simple  and  intuitive  to  use,  execution 
profiles  reveal  program  bottlenecks  that  impede  execution. 

The  advent  of  the  MIMD  parallel  system  raises  two  challenges  to  con- 
ventional profiling.  The  first  problem  is  an  exploding  state  space.  Program 
profiles  on  serial  machines  implicitly  define  a set  of  disjoint  execution  states 
whose  occupancies  sum  to  a total  response  time.  Each  execution  thread  of 
an  MIMD  program  defines  a similar  set  of  (sub)states.  Unfortunately,  the 
set  of  states  for  the  whole  MIMD  program  is  enormous.  To  see  this,  imagine 
first  a serial  program  with  a main  procedure  and  four  callable  procedures; 
there  are  five  states  at  the  procedural  level  of  profiling.  Now  consider  a par- 
allel version  of  this  program  on  a small,  eight-processor  system.  Eight  active 
threads,  each  with  five  substates,  lead  to  a state  set  whose  size  is  5®  = 390,  625 
states.  Processor  inactivity  will  further  increase  this  number.  Choosing  not 
to  distinguish  among  functionally  identical  processors  collapses  some  states 
into  what  can  be  termed  macrostates,  but  the  fundamental  problem  remains; 
The  program  state  space  becomes  enormous  as  the  scalable  parallel  system 
grows  larger. 

A second  MIMD  profiling  challenge  is  the  coupling  among  profile  states 
caused  by  parallel  execution.  Conventional  profile  statistics  require  a much 
deeper  interpretation  in  MIMD.  This  problem  is  to  be  expected.  With  sepa- 
rate threads  of  execution  working  on  a joint  computation,  it  is  natural  that 
communication  and  constraints  must  exist  among  threads.  Interdependen- 
cies are  manifest  as  latencies-a  wait  for  a message,  a pause  prior  to  writing 
some  shared  variable.  Because  latencies  are  generated  by  circumstances  of 
the  system  and  program,  they  are  not  easily  estimated.  Latencies  can  range 
from  negligible  to  devastatingly  large. 

1.2  Related  Work 

Existing  techniques  collect  performance  statistics  in  a number  of  ways.  In 
the  taxonomy  shown  in  Table  1,  the  first  row  shows  two  common  methods  of 
defining  events  to  be  recorded.  The  first  method  is  periodic  sampling  (I), 
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which  is  tied  to  a clock  and  is  based  on  collecting  statistics.  For  example, 
at  regular  time  intervals,  an  interrupt  may  be  generated  and  the  program 
counter  at  that  point  looked  up  in  an  allocation  table.  This  gives  the  name 
of  the  procedure  that  was  active  at  the  clock  tick.  Periodic  sampling  is 
very  popular  for  instrumenting  systems  that  run  an  anonymous  collection 
of  programs.  No  changes  are  necessary  to  any  user  program.  By  adjusting 
the  sampling  frequency,  the  overhead  can  be  adjusted  to  some  convenient 
level.  One  big  disadvantage  is  in  testing  coverage;  if  a piece  of  code  is  not 
recorded  at  having  been  run,  it  may  in  fact  not  have  run,  or  the  sampling 
may  have  been  unlucky.  The  system  also  has  problems  with  interpretive 
language  systems,  since  locations  within  the  interpreter  mean  little  to  a user. 


I.  Periodic  Sampling 

II.  Fixed  Triggering 

a.  Traces 

b.  Histograms 

Table  1:  Simple  Taxonomy  of  Performance  Techniques. 

The  second  method  of  fixed  triggering  (II)  uses  identifiable  locations 
or  patterns,  which  when  reached  or  matched,  define  an  event.  For  instance, 
a special  procedure  call  upon  entry  to  a segment  of  executable  statements 
will  record  information  about  the  program  at  that  point.  Fixed  triggering  is 
bound  to  features  of  software  or  hardware.  Hence,  even  if  a few  instructions  of 
an  instrumented  procedure  execute,  this  will  be  indicated  accurately.  Testing 
coverage  for  software  is  quite  clear.  A major  drawback  is  setup.  Each  soft- 
ware or  hardware  event  of  interest  must  have  corresponding  triggers  defined 
within  the  monitoring  system.  The  technique  is  not  generally  satisfactory 
for  a constantly  changing  population. 

The  bottom  row  of  the  table  gives  common  types  of  recorded  information. 
A trace  (a)  often  comprises  a record  of  a location  in  code  or  a configuration 
of  a subsystem  plus  a time-stamp.  A constant  stream  of  traces  is  generated  as 
system  execution  proceeds,  and  from  this  data  much  important  behavior  can 
be  reconstructed.  Unfortunately,  the  stream  is  often  hard  to  manage  because 
of  its  magnitude.  Special  collection  hardware  may  be  needed  to  handle  the 
volume  of  data[13,  14]. 

Histograms  (b)  are  an  accumulative  approach  that  demands  little  ex- 
tra bandwidth.  The  number  of  invocations  of  a procedure,  the  overall  time 
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spent  in  a loop-  these  are  of  type  (b),  histogram  or  profile  statistics.  Because 
histogram  information  accumulates,  they  demand  far  less  storage  or  band- 
width than  do  traces.  The  cost  is  a loss  of  detail,  since  time  is  not  generally 
recorded  except  as  an  accumulated  amount.  No  detailed  times  are  kept  of 
individual  events. 

Tools  gprof  and  quartz  are  of  type  I-b.  The  VLSI  instrumentation  chip 
MultiKron[14:]  supports  either  Il-a  or  Il-b.  MTOOL,  triggered  by  basic  pro- 
gram blocks,  builds  histograms  and  is  therefore  of  type  Il-b.  A type  I-a  is 
uncommon,  since  periodic  random  sampling  yields  an  erratic  set  of  data.  It 
is  not  clear  what  detailed  I-a  traces  could  contribute  when  the  actual  infor- 
mation lies  more  in  the  aggregate  statistical  distribution  of  samplings  than 
in  any  one  sample. 

1.3  Program  Sensitivities  without  Conventional  Pro- 
files 

A practical  code  improvement  scheme  depends  upon  identifying  the  most  sen- 
sitive sections  within  a program,  so  that  worst  bottlenecks  can  be  corrected. 
Fortunately,  the  conventional  execution  profile  is  not  the  only  avenue.  An 
alternate  approach  treats  program  and  system  together  as  an  entity  of  es- 
sentially unfathomable  complexity.  Here,  program  segments  {s^}  suspected 
of  being  bottlenecks  are  explored  via  systematic  perturbations  of  their  code. 
This  generates  different  versions  of  the  program.  Overall  program  responses 
are  measured  and  recorded  for  each  variant.  The  responses  are  then  used 
to  solve  mathematically  for  sensitivities  of  the  segments  {^i}.  The  ques- 
tion of  state  in  this  approach  has  been  shifted  from  the  executing  program 
to  simpler,  source  code  defined  settings.  This  new  state  space  is  smaller, 
clearer  and  static.  Furthermore,  there  exists  a whole  body  of  mathemat- 
ics that  simplifies  its  handling  and  interpretation.  This  is  the  statistics  of 
design  of  experiments  (DEX)[2].  Experimental  designs  especially  address  in- 
teractions. Focusing  upon  perturbation  settings  and  measured  responses,  the 
DEX  analysis  is  designed  to  catch  likely  interactions  that  might  impede  good 
performance.  Each  segment  in  {sj}  in  and  of  itself  might  not  impede  parallel 
execution,  but  together,  some  combinations  may  cause  disastrous  slowdowns 
(see  example  in  [1]).  The  DEX  approach  can  indicate  interactions  easily  and 
clearly. 
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One  major  problem  in  the  past  with  applying  DEX  to  software  has  been  in 
finding  suitable  ways  to  perturb  program  code.  Natural  program  parameters 
work  fine,  but  they  are  not  commonly  available  for  arbitrary  segments.  An 
alternative  is  to  recode  a segment  from  in  a new  faster  or  slower  version. 
The  problem  is  the  recoding,  which  must  be  made  and  checked  very  carefully 
for  algorithmic  correctness.  The  perturbed  version  must  compute  exactly  the 
same  internal  and  external  results.  Recoding  is  slow  and  checking  is  tedious. 
Furthermore,  each  segment  must  be  treated  in  this  ad  hoc  manner.  The 
efficient  solution  is  to  make  all  perturbations  artificial.  By  doing  this,  each 
synthetic  perturbation  is  easily  introduced  or  removed,  and  yet  it  does  not 
interfere  with  the  computation  of  the  original  code.  Since  synthetic  code 
does  simulate  changes  in  coding  to  a segment,  DEX  analysis  proceeds  in  its 
normal  fashion. 


2 Description  of  Technique 

Synthetic-Perturbation  Tuning  (SPT)  is  an  empirical  approach  that  treats 
an  MIMD  program  as  a black  box  with  input  parameters  and  outputs.  The 
SPT  approach  introduces  synthetic  perturbations  (i.e.,  artificial  delays)  into 
source  code  segments  and  relies  on  (for  design  and  analysis)  a modern  branch 
of  statistical  theory  called  design  of  experiments  (DEX)  [2,  3,  5,  20,  21].  DEX 
provides  an  efficient  methodology  for  determining  the  relative  sensitivity  of 
the  MIMD  program  to  synthetic  perturbations.  SPT  focuses  the  program- 
mer’s attention  on  the  potential  problem  areas  in  the  program.  An  important 
step  in  this  methodology  is  to  identify  which  segments  of  code  are  candidates 
for  improvements.  The  identified  code  segments  are  termed  bottlenecks.  Each 
bottleneck  is  ranked  quantitatively  according  to  its  sensitivity  to  synthetic 
perturbation.  Such  a list  is  called  an  SPT  Rank.  An  SPT  rank  is  a guide 
that  can  be  used  to  improve  (tune)  the  corresponding  code  segments. 

The  SPT  premise  is  that  if  the  program  is  highly  sensitive  to  source  code 
perturbations  in  a code  segment  (i.e.,  delay  has  a clearly  detrimental  effect 
on  performance),  then  source  code  improvements  to  that  segment  will  have 
an  opposite  (positive)  effect.  This  premise  is  easy  to  justify  for  serial  code 
since  the  SPT  ranking  can  be  done  so  that  it  corresponds  to  a combination 
of  how  often  a section  of  code  is  executed  and  its  execution  time.  In  the 
next  section,  a justification  of  this  premise  as  it  applies  to  shared  memory 
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programs  is  provided. 

In  what  follows,  we  describe  the  generic  SPT  methodology  for  tuning  an 
MIAID  program.  E.xtensions  of  SPT  for  capturing  specific  features  for  the 
shared  memory  programming  model  are  given  in  the  next  section. 

1.  Determine  objective  and  define  test  conditions.  The  first  step  in 

SPT  is  to  determine  the  goal  of  the  tuning  effort.  A common  objective 
of  SPT  is  to  make  a rank-ordered  list  of  the  source  code  segments  based 
on  the  relative  sensitivity  of  the  MIMD  program  to  synthetic  delays 
associated  with  the  code  segments.  The  segments  that  rank  high  on 
this  list  are  potential  bottlenecks  during  the  execution  of  the  program. 

To  perform  a set  of  SPT  experiments,  the  user  must  define  a set  of  test 
conditions.  Test  conditions  include  the  source  code  implementation, 
data  set,  and  machine.  SPT’s  analysis  applies  to  the  defined  test  con- 
ditions. If  these  conditions  change,  a new  set  of  SPT  experiments  and 
analysis  may  need  to  be  performed.  Based  on  our  experience,  given 
a source  code  and  a machine,  results  for  similar  data  sets  are  usually 
consistent. 

2.  Choose  candidate  code  segments.  A candidate  code  segment  can  be 

any  section  of  code.  Typically  it  is  a function  declaration,  function  call 
(usually  for  synchronization,  e.g.,  a send  protocol  or  a locking  mecha- 
nism), critical  section,  or  a loop  construct.  Selection  of  candidate  code 
segments  can  involve  a number  of  techniques.  Important  factors  that 
help  in  narrowing  the  field  of  all  possible  code  segments  include  the 
users  knowledge  of  the  program  and  code  inspection.  A conventional 
profiling  tool  can  aid  in  this  process  as  well.  Automatic  selection  is  also 
possible.  The  user  can  perform  preliminary  experiments  on  the  set  of 
all  possible  user  defined  code  segments  (e.g.,  the  user  may  choose  to 
examine  all  loop  constructs  or  all  critical  sections).  Brief  experiments 
and  analysis  quickly  screen  out  unlikely  bottlenecks.  This  preparatory 
process  reduces  the  field  to  a manageable  size  for  which  more  exhaustive 
testing  can  be  performed. 

3.  Insert  Perturbations.  Each  candidate  code  segment  is  instrumented 

with  a delay  option  {delay  or  no  delay).  No  delay  leaves  the  code 
unperturbed.  Delay  takes  the  form  of  a function  call  that  performs  a 
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specified  number  of  instructions.  The  call  does  not  alter  the  natural 
path  of  the  program  or  the  values  of  its  variables.  It  merely  attaches 
a specified  number  of  instructions  to  that  code  segment.  The  delays 
could  be  of  different  lengths  [1].  However,  for  simplicity,  we  opted 
to  implement  constant  delays  at  the  source  code  level.  Thus  in  the 
example  described,  delay  has  two  possible  values,  zero  or  a fixed  number 
irrespective  of  the  code  segment.  An  example  of  how  a delay  might  be 
implemented  is  given  in  the  following  pseudo  C source  code  block; 

while(v-)  { /*  factor  12  */  /*  begin  original  code  */ 

F12  /*  begin  spt  code  */ 

spt_delay  ( delay _value); 

T^endif  /*  end  spt  code  */ 

Code 

} /*  end  original  code  */ 


spt.delayO  is  a function  that  performs  a specified  number  of  synthetic 
instructions  corresponding  to  delay^value.  The  implementation  of  the 
delay  function  must  yield  a consistent  delay  while  not  altering  the  nat- 
ural path  of  the  program.  The  looping  block  while(  V-)  { ...}  is  a des- 
ignated code  segment  and  referred  to  as,  for  example,  factor  12  {F12). 
The  statistical  term  factor  is  used  to  represent  a candidate  code  seg- 
ment. Conditional  compilation  creates  multiple  versions  of  the  program 
corresponding  to  pattern  of  delays  indicated  by  the  experimental  plan 
(next  step). 

The  duration  of  the  artificial  delay  is  an  important  aspect.  Ideally,  the 
delay  should  be  long  enough  so  that  it  can  easily  be  distinguished  from 
noise  and  short  enough  so  as  not  to  produce  unnecessarily  long  pro- 
gram execution  times.  The  magnitude  of  the  delay  is  often  determined 
through  trial  and  error.  A discussion  of  important  aspects  for  choosing 
the  delay  magnitude  can  be  found  in  the  extended  version  of  [1].  In 
the  next  three  steps  we  describe  how  SPT  experimental  design  plans 
are  developed  and  used. 
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Treatment 

Factors 

Response 

Fl 

F2 

F3 

1 

— 

— 

— 

17.05 

2 

+ 

— 

— 

17.08 

3 

— 

+ 

— 

23.19 

4 

+ 

+ 

— 

23.34 

5 

— 

— 

+ 

19.62 

6 

+ 

— 

+ 

19.71 

7 

— 

+ 

+ 

25.61 

8 

+ 

+ 

+ 

25.71 

Table  2:  2^Complete  Factorial  Design  for  Xprog. 

4.  Design  experimental  plan.  Once  the  candidate  code  segments  are  de- 
termined, an  experimental  plan  can  be  developed.  There  is  no  theo- 
retical limit  on  how  many  distinct  factors  (source  code  segments)  can 
be  investigated  on  a given  SPT  iteration.  A variety  of  schemes  can  be 
used  for  designing  an  experimental  plan[2,  5].  A small  2^  factor  com- 
plete factorial  example  is  given  to  illustrate  the  ideas  of  experimental 
designs.  A 2"^  plan  indicates  that  the  experiment  has  n factors  each  at 
2 levels.  Here  we  have  n = 3 factors  (called,  for  example  FI,  F2,  and 
F3)  and  2 levels  {no  delay  ( — ) and  delay  ( + ))  for  each  factor. 

Suppose  we  have  a MIMD  program  (call  it  Xprog)  with  three  suspected 
bottleneck  locations,  FI,  F2,  and  F3  that  correspond  to  certain  code 
segments  within  Xprog.  Fl  represents  a for  loop  in  the  function  func^Y, 
represents  a critical  section  in  the  function  func.Z,  and  F5  represents 
a while  loop  in  the  function  func^Z. 

With  a three  factor  complete  factorial  plan,  there  are  2^  = 8 possible 
delay  patterns  each  indicated  by  a row  in  Table  2.  A plus  sign  ( + ) in 
a given  row  denotes  that  the  corresponding  code  segment  is  perturbed, 
and  a minus  sign  ( — ) indicates  that  the  corresponding  code  segment 
is  unperturbed  (i.e.,  it  retains  its  original  code).  In  DEX  terminology 
each  delay  pattern  is  a treatment.  Table  2 lists  the  eight  treatments 
and  corresponding  response  which  is  the  total  execution  time  of  the 
MIMD  program.  Note  that  the  first  treatment  (all  minuses)  represents 
the  original  unperturbed  program  code. 
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The  basis  of  an  SPT  rank  associated  with  a source  code  segment  is  a 
quantitative  measure  called  main  effect  associated  with  that  code  seg- 
ment. In  our  2^  example  the  ranking  of  the  three  factors  is  based  on 
their  main  effects  (computation  of  the  main  effects  will  be  illustrated). 
A main  effect  of  a factor  (code  segment)  is  a measure  of  the  sensitiv- 
ity of  the  MIMD  program  to  the  artificial  delay  in  that  code  segment. 
Depending  on  the  experiment  plan  that  is  used,  this  measure  can  be 
affected  in  unknown  ways  by  the  interactions  amongst  the  code  seg- 
ments. In  this  paper  we  propose  the  use  of  experimental  plans  called 
resolution  IV  plans  that  ensure  that  the  main  effects  are  not  affected 
by  the  2nd-order  interactions  amongst  the  code  segments [2].  The  total 
number  of  test  runs  required  by  a resolution  IV  plan  with  k factors  is 
approximately  2k. 

5.  Run  experiments  according  to  plan  and  record  a response.  Each 

treatment  or  version  of  the  program  is  run  and  the  corresponding  re- 
sponse is  recorded.  The  response  can  be  any  useful  measurement;  typi- 
cally the  response  is  the  total  program  execution  time.  The  treatments 
are  usually  run  in  a random  order. 

In  our  example  (Table  2),  all  eight  versions  of  the  program  are  compiled, 
run  in  random  order  and  measured  for  execution  time  . The  response 
time  for  each  treatment  of  the  program  is  given  in  the  Response  column. 

6.  Analyze  Results.  The  object  of  data  analysis  is  to  evaluate  the  main 

effects  associated  with  each  factor.  The  computed  values  of  the  main 
effects  are  subsequently  used  to  produce  an  SPT  ranking  of  the  factors. 
In  addition  to  the  main  effects,  a resolution  IV  plan  provides  a measure 
of  the  standard  error  (a  measure  of  uncertainty)  associated  with  the 
computed  values. 

The  main  effect  of  a factor  is  the  difference  between  two  average 
responses,  one  corresponding  to  the  treatments  which  have  the  ( + ) 
level  of  the  factor  and  the  other  corresponding  to  the  treatments  which 
have  the  ( — ) level  of  the  factor.  For  example,  in  the  2^  plan  (Table  2), 
the  main  effect  of  factor  F3  is  the  average  response  for  treatments  5, 
6,  7,  and  8 (i.e.,  [19.62  + 19.71  + 25.61  + 25.71]/4  = 22.66),  minus  the 
average  response  for  treatments  1,  2,  3,  and  4 (i.e.,  [17.05  + 17.08  + 
23.19  + 23.34]/4  = 20.17).  Thus  the  main  effect  for  F3  is  2.49.  The 
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Rank 

Eactor 

Main  Effect  f 

Routine 

Construct 

1 

F2 

6.10 

func_Z() 

while  loop 

2 

F3 

2.49 

func_Z() 

critical  section 

3 

El 

0.09 

func_Y( ) 

for  loop 

t Standard  Error  of  Main  Effects:  ±0.06 


Table  3:  SPT  Rank  of  Main  Effects  for  Xprog. 

main  effects  are  organized  into  an  ordered  list  to  form  an  SPT  ranking 
of  the  code  segments.  An  example  of  an  ordered  list  (from  Table  2) 
that  can  be  produced  by  SPT  is  shown  in  Table  3.  This  SPT  rank  is 
the  format  we  use  throughout  the  rest  of  the  paper. 

The  first  column  of  the  SPT  rank  gives  the  standing  of  the  correspond- 
ing code  segment.  A higher  rank  indicates  a higher  sensitivity  to  arti- 
ficial delays  (e.g.,  F2  is  most  sensitive  to  the  delay).  Column  2 gives 
the  factor  number  which  provides  a reference  back  to  the  source  code 
location  represented  by  the  factor.  The  main  effects  column  gives  the 
sensitivity  levels  of  the  corresponding  code  segments  as  well  as  an  esti- 
mate of  the  standard  error'*.  The  actual  numbers  are  not  as  important 
as  their  relative  magnitudes.  Column  4 describes  which  function  the 
section  of  code  resides  in.  The  last  column  indicates  what  type  of  con- 
struct the  code  segment  is.  By  surveying  Table  3 we  can  conclude  that 
factor  F2  is  the  most  significant.  This  code  segment  should  be  given 
first  priority  in  the  tuning  effort. 

7.  Improve  bottlenecks  and  determine  performance.  An  SPT  rank  gives 
a list  of  potential  bottlenecks.  The  bottlenecks  so  identified  may  or  may 
not  be  improvable.  Investigation  begins  with  the  higher  ranked  bottle- 
necks since  they  possess  the  greatest  potential  for  improvement.  These 
bottlenecks  can  be  pursued  further  with  SPT  to  gain  more  information 
about  the  bottlenecks  or  an  attempt  can  be  made  to  improve  them. 

After  improvements  to  the  code  are  attempted  the  user  must  make  a 
determination  of  performance.  If  the  desired  performance  is  obtained 
the  process  is  complete.  Otherwise  the  user  can  continue  to  investigate 

^The  standard  error  of  the  main  effect  is  evaluated  by  treating  high  order  interactions 

as  errors  from  noise  (see[2],  page-327). 
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the  program  by  using  SPT  to  probe  the  code  further. 

This  methodology  provides  the  basic  SPT  framework.  Within  this  frame- 
work, expanded  issues  relevant  to  a particular  programming  model  and  archi- 
tecture can  be  handled.  For  example,  on  the  shared  memory  programming 
model,  programming  concerns  such  as  degree  of  parallelism,  load  balancing, 
critical  sections,  and  synchronization  can  easily  be  investigated.  The  next 
section  addresses  these  issues. 


3 SPT  Applied  to  Shared  Memory  Programs 

The  emergence  of  shared  memory  multiprocessors  in  the  past  decade  has 
given  rise  to  a substantial  effort  in  designing  and  analyzing  software  for  these 
machines.  According  to  Bell  [17],  “the  mainline,  general-purpose  computer  is 
almost  certain  to  be  the  shared  memory,  multiprocessor  after  1995.”  Hence 
it  is  important  to  develop  tuning  tools  for  shared  memory  programs.  The 
SPMD  (Single  Program,  Multiple  Data)  model  using  a single  address  space 
is  the  natural  programming  model  for  shared  memory  multiprocessors.  This 
programming  model  can  be  viewed  as  an  evolution  of  the  traditional  pro- 
gramming model  used  for  von  Neumann  architectures.  The  performance  of 
a shared  memory  program  depends  crucially  on  several  interrelated  factors 
such  as  the  amount  of  parallelism  used,  the  degree  to  which  the  work  load  is 
balanced  among  the  processors,  the  contention  over  shared  resources  (inter- 
connection network,  bus,  memory),  and  the  overhead  incurred  by  synchro- 
nization. Unless  a balance  taking  into  consideration  the  relative  importance 
of  these  factors  is  maintained,  the  actual  performance  of  shared  memory  pro- 
grams will  be  disappointing.  In  fact,  experimental  work  thus  far  bears  this 
out.  In  the  rest  of  this  section,  we  describe  our  approach  for  finding  bot- 
tlenecks related  to  each  of  these  aspects  as  they  arise  in  a shared  memory 
program  (some  strategies  apply  to  message  passing  environments  as  well). 
The  SPT  approach  described  in  the  previous  section  is  expanded  to  deter- 
mine the  sources  of  potential  bottlenecks.  In  the  next  section,  we  illustrate 
the  use  of  these  techniques  on  two  case  studies,  the  Image  Understanding 
Benchmark^  and  a parallel  version  of  the  quicksort  algorithm. 

Degree  of  Parallelism:  A typical  MIMD  program  contains  a mix  of  scalar, 
serial,  vector,  and  parallel  operations.  A section  of  code  with  insufficient 
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parallelism  is  a bottleneck  if  its  execution  time  is  significant  compared  to  the 
overall  execution  time  of  the  program.  Such  a bottleneck  can  be  detected  only 
if  the  performance  of  the  program  is  analyzed  as  a function  of  the  number  of 
processors  involved.  In  fact,  by  Amdhal’s  law,  for  a given  program,  it  is  the 
execution  time  of  the  serial  portions  that  will  ultimately  determine  the  speed 
of  the  program  as  the  number  of  processors  increases  (and  the  input  size  is 
held  constant).  Our  method  is  based  on  an  extension  of  this  observation. 

We  insert  artificial  delays  into  the  sections  of  code  under  investigation. 
We  then  perform  the  design  of  experiments  on  successively  scaled-up  versions 
of  the  system.  As  the  number  of  processors  increases,  the  effects  of  the 
parallel  code  will  become  less  important  while  the  effects  of  the  serial  code 
will  become  more  significant. 

Consider  for  example  a section  of  code  that  multiphes  an  n x n matrix  A 
by  a vector  x to  generate  the  vector  y = Ax.  Partition  A as  follows 


Ai 


■^p 


where  each  Ai  is  of  size  [nip]  x n,  n/p  is  assumed  to  be  an  integer,  and 
p is  the  number  of  processors  available.  The  following  section  of  the  code 
corresponds  to  the  computation  performed  by  the  ith  processor 

for  j = (i  — 1)^  + 1 to  do 

y{j)  ■=  0 

for  ^ 1 to  n do 

SPT-delay 

yU)  •=  yU)  + 

end 

end 

The  execution  time  of  this  section  of  code  is  proportional  to  ^{A  + 2t /p), 
where  A is  the  SPT  delay  time,  and  t fp  is  the  time  it  takes  to  execute  a 
floating-point  add  or  multiply  (assumed  to  be  equal  for  simplicity).  Hence 
the  effect  of  the  SPT  delay  is  a net  increase  of  —A  in  the  total  execution 
time;  thus,  it  represents  a factor  whose  effect  is  a decreasing  function  of 
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p.  Therefore,  the  effect  of  the  parallel  code  becomes  less  important  as  the 
system  is  scaled-up. 

Load  Balancing:  The  speedup  achieved  by  a parallel  program  is  primarily 
due  to  the  development  of  threads  of  execution  that  can  be  run  concurrently. 
This  can  be  done  either  by  using  functional  or  data  decomposition  present 
(explicitly  or  implicitly)  in  an  existing  algorithm,  or  by  developing  a new 
algorithm  that  has  a higher  degree  of  (functional  or  data)  parallelism.  With 
functional  decomposition,  each  processor  is  responsible  for  executing  a differ- 
ent function,  and  hence  the  distribution  of  the  loads  among  the  processors  is 
completely  dependent  on  the  computational  requirements  of  these  functions. 
Similarly,  data  decomposition  can  result  in  some  processors  having  to  handle 
much  larger  amounts  of  data  than  the  rest  of  the  processors. 

A load  balancing  problem  can  be  viewed  as  insufficient  parallelism  that, 
in  general,  arises  dynamically.  The  insertion  of  artificial  delays  followed  by  an 
SPT  analysis  allows  us  to  determine  each  section  of  the  code  that  generates  a 
significant  load  imbalance.  Notice  that  an  SPT  delay  will  cause  the  processor 
with  the  heaviest  load  to  run  even  slower  and  hence  its  SPT  effect  will  be 
significant.  Consider  for  example  the  case  when  there  are  p processors  that 
have  to  be  assigned  to  process  (say,  search  for  a specific  item)  n lists  Lj  of 
different  sizes,  for  0 < j < n.  We  can  use  data  decomposition  by  assigning 
the  ith  processor  to  process  the  lists  Li,  Li^p,  • • •,  for  0 < i < p.  The  following 
program  segment  illustrates  such  a decomposition. 


for  [i  = id]i  < n;  i+  = p)do 

{ for(j  =head(z);j!  =:NULL;  j = j—  >next)do 
SPT-Delay 
{ Process  Node  j } 

} 

The  impact  of  the  SPT  delay  A is  proportional  to  A maxo<j<p{|Ti|  + \Li+p\  + 
• • •}.  It  follows  that  the  larger  the  total  size  of  the  lists  to  be  processed  by 
a single  processor,  the  more  significant  the  SPT  contribution  of  the  corre- 
sponding factor. 

In  the  Image  Understanding  benchmark  that  we  study  in  the  next  sec- 
tion, we  use  this  technique  to  predict  the  load  imbalance  that  is  caused  by 


14 


the  procedure  to  determine  the  connected  components  of  an  image.  In  this 
case,  the  processor  assigned  to  handle  the  background  pixels  has  much  more 
work  to  do  than  the  remaining  processors.  Without  analyzing  the  procedure, 
our  SPT  analysis  was  able  to  determine  the  load  imbalance  resulting  from 
this  procedure  and  to  predict  its  importance  as  the  number  of  processors 
increases. 

Critical  Sections  and  Synchronization:  Processors  executing  a shared 
memory  program  may  waste  a substantial  amount  of  time  trying  to  enter  a 
critical  section  (“busy  wait”)  or  trying  to  synchronize  their  activities.  SPT 
can  be  used  to  provide  information  concerning  any  significant  overhead  in- 
curred in  a critical  section  or  at  a synchronization  point.  We  start  by  handling 
critical  sections. 

The  insertion  of  an  artificial  delay  into  a critical  section  allows  us  to 
perform  an  SPT  analysis  similar  to  the  previous  two  cases.  We  claim  that, 
for  a critical  section  that  represents  a significant  bottleneck  in  the  program, 
its  SPT  effect  will  become  more  important  as  we  scale-up  the  system.  In  fact, 
the  overall  contribution  of  the  delays  tends  to  be  cumulative  with  respect  to 
the  number  of  processors  that  are  trying  to  access  the  critical  section. 

As  for  synchronization,  we  cannot  use  the  technique  in  a straightforward 
way.  However  we  can  extend  it  as  follows.  For  each  synchronization  bar- 
rier, we  insert  two  types  of  perturbations,  one  immediately  before  the  barrier 
and  the  other  immediately  after  the  barrier.  The  perturbation  FBI  inserted 
before  the  barrier  consists  of  an  artificial  critical  section,  while  the  pertur- 
bation FB2  inserted  after  the  barrier  consists  of  an  artificial  critical  section 
followed  by  an  artificial  barrier.  The  justification  of  the  perturbation  FB2  is 
as  follows.  The  critical  section  delay  in  FB2  is  an  obvious  bottleneck  since 
all  the  released  threads  try  to  execute  it  at  once  whereas  the  artificial  barrier 
ensures  that  the  new  program  is  functionally  identical  to  the  original  one. 
We  then  run  our  experiments  and  compare  the  effects  of  FBI  and  FB2.  If 
their  effects  are  about  the  same,  we  can  conclude  that  the  synchronization 
cost  is  marginal.  The  argument  is  that  in  this  case  FBI  is  also  being  pressed 
for  execution  by  many  threads,  which  is  indicative  of  how  threads  arrive  at 
the  barrier-  all  together  - a good  parallel  execution.  As  the  difference  in  the 
two  effects  increases,  the  synchronization  cost  increases.  Threads  that  arrive 
one-by-one  at  FBI  will  not  find  it  much  of  a bottleneck  and  hence  its  effect 
will  be  lower  than  that  of  FB2.  It  follows  that  by  comparing  the  effects  of 
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FBI  and  FB2^  vve  will  be  able  to  diagnose  a barrier  being  used  efficiently. 
This  method  is  applied  in  the  next  section  to  a quicksort  program  that  con- 
tains several  synchronization  points  and  is  shown  to  identify  properly  the 
costly  synchronizations. 

Summary:  SPT  can  be  used  to  detect  bottlenecks  due  to  lack  of  parallelism, 
load  imbalance,  and  critical  sections,  by  simply  inserting  artificial  delays  into 
appropriate  sections  of  the  code  and  conducting  a design  of  experiments  and 
an  SPT  analysis  as  described  in  Section  2.  As  for  synchronization,  we  can 
insert  two  types  of  delays,  one  immediately  before  and  the  other  immediately 
after  each  synchronization  barrier,  and  conduct  the  design  of  experiments  and 
an  SPT  analysis  as  before.  Therefore  SPT  can  be  used  to  detect  the  main 
sources  of  inefficiency  in  a shared  memory  program.  In  the  next  section,  we 
illustrate  our  techniques  on  two  detailed  case  studies. 


4 Case  Studies 

4.1  Image  Processing  Benchmark 

In  this  section  we  present  a practical  shared  memory  tuning  example  based 
upon  a large  image  processing  benchmark.  The  test  code  is  the  Image  Un- 
derstanding Benchmark  for  parallel  computers  developed  at  the  University 
of  Massachusetts  at  Amherst  [15].  The  benchmark  was  described  as  a “com- 
plex benchmark  that  would  be  almost  impossible  to  tune”  [15].  Using  SPT, 
we  demonstrate  how  important  bottlenecks  were  identified  and  subsequently 
analyzed  and  improved. 

The  benchmark  was  designed  to  test  common  vision  tasks  on  parallel 
architectures.  It  consists  of  a model- based  object  recognition  problem,  given 
two  sources  of  sensory  input,  intensity  and  range  data,  and  a collection  of 
candidate  models.  The  intensity  image  is  a 512  x 512  array  of  8-bit  pixels, 
while  the  depth  image  consists  of  a 512  x 512  array  of  32-bit  floating  point 
numbers.  The  models  contain  rectangular  surfaces,  floating  in  space,  viewed 
under  orthographic  projection.  .Added  to  the  configuration  is  both  noise 
and  spurious  nonmodel  surfaces.  The  benchmark’s  task  is  to  recognize  an 
approximately  specified  2 1/2-dimensional  “mobile”  sculpture  in  a cluttered 
environment.  The  sculpture  is  a collection  of  2-dimensional  rectangles  of 
various  sizes,  brightnesses,  orientations,  and  depths. 
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The  experiments  are  performed  on  both  a ten  processor  and  twenty-six 
processor  Sequent  Symmetry.  The  Image  Understanding  Benchmark  package 
comes  with  a number  of  data  sets  and  their  corresponding  outputs.  The 
example  presented  here  uses  test  set  number  two.  The  benchmark  consists 
of  more  than  50  procedures  and  has  approximately  3500  lines  of  C code. 

Our  objective  for  performing  an  SPT  analysis  on  this  example  is  to  screen 
the  code  for  potential  bottlenecks  at  different  levels  of  parallelism.  We  se- 
lected 31  factors  (loops,  function  declarations,  and  critical  sections)  as  po- 
tential candidates  for  bottlenecks  based  on  code  inspection.  An  experimental 
plan  is  selected  to  handle  the  large  number  of  code  segments  that  need  to 
be  investigated.  The  image  benchmark  is  instrumented  with  an  SPT  delay 
for  each  factor.  The  treatments  are  run  in  a random  order  and  the  overall 
execution  time  of  the  program  is  recorded  as  the  response. 

Table  4 lists  the  main  effects  of  the  31  factors  of  the  image  processing 
benchmark  running  on  8 processors.  This  initial  set  of  experiments  indicates 
that  the  three  top  ranked  procedures  (Gradient  Magnitude,  Median  Filter- 
ing, and  Connected  Components)  represent  major  bottlenecks.  Hence  tuning 
the  corresponding  code  segments  should  be  given  first  priority.  Notice  that 
none  of  the  top  ranked  factors  involves  a critical  section  or  a synchronization 
barrier.  Therefore  the  emphasis  of  the  tuning  effort  should  concentrate  on 
increasing  the  efficiency  of  the  serial  sections  within  the  loops  (corresponding 
to  factors  FIT,  F26  and  F2),  or  better  balancing  the  load  among  the  pro- 
cessors, or  increasing  the  degree  of  parallelism.  Since  factor  FIT  was  ranked 
highest,  we  concentrated  initially  on  the  corresponding  code  segment. 

The  Gradient  Magnitude  procedure  performs  a standard  3 x 3 Sobel  oper- 
ation on  the  depth  image.  The  section  of  code  within  the  loop  corresponding 
to  factor  FIT  is  quite  inefficient.  After  removing  multiplications  by  zeros, 
and  reducing  the  total  number  of  remaining  multiplications,  the  execution 
time  of  the  procedure  improved  300%.  At  this  point,  the  relative  ranking  of 
the  procedure  dropped  to  8 with  8 processors  (Table  9). 

Our  next  task  was  to  consider  the  Median  Filtering  procedure.  While 
we  were  attempting  to  tune  this  procedure,  we  discovered  that  the  procedure 
generated  erroneous  results.  At  this  time  we  switched  our  efforts  to  tuning  the 
third  procedure.  Connected  Components.  This  procedure  assigns  a unique 
label  to  each  contiguous  collection  of  pixels  having  the  same  intensity  level 
value.  To  gain  a better  understanding,  we  ran  our  experiments  using  2,  4,  8, 
and  24  processors.  Tables  5,  6,  T,  and  8 show  the  resulting  rankings  of  the 
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Rank 

Eactor 

Main  Effect  f 

Routine 

Construct 

1 

17 

6.03 

Gradient  Magnitude 

for  loop 

2 

26 

5.46 

Median  Eiltering 

while  loop 

3 

2 

5.26 

Connected  Components 

for  loop 

4 

1 

3.94 

Connected  Components 

function 

5 

4 

3.84 

Connected  Components 

while  loop 

6 

25 

2.01 

Median  Filtering 

for  loop 

7 

20 

1.67 

Match 

function 

8 

29 

1..36 

Probe 

for  loop 

9 

6 

0.64 

Extract  Cues 

for  loop 

10 

21 

0.53 

Match 

for  loop 

11 

19 

0.16 

K-curvature 

for  loop 

12 

18 

0.10 

K-curvature 

for  loop 

13 

12 

0.10 

Complete  Match 

critical  section 

14 

11 

0.08 

Complete  Match 

while  loop 

15 

13 

0.08 

Complete  Match 

while  loop 

16 

8 

0.05 

Complete  Match 

function 

17 

10 

0.05 

Complete  Match 

critical  section 

18 

24 

0.05 

Median  Filtering 

while  loop 

19 

5 

0.05 

Connected  Components 

while  loop 

20 

15 

0.04 

Extract  Cues 

critical  section 

21 

16 

1.04 

Complete  Match 

critical  section 

22 

3 

0.04 

Connected  Components 

while  loop 

23 

14 

0.03 

Complete  Match 

critical  section 

24 

28 

0.03 

Probe 

function 

25 

7 

0.02 

Complete  Match 

while  loop 

26 

27 

0.02 

Probe 

for  loop 

27 

22 

0.02 

Median  Filtering 

for  loop 

28 

23 

0.01 

Median  Filtering 

for  loop 

29 

9 

0.01 

Complete  Match 

function 

30 

31 

0.00 

Trace  Boundary 

while  loop 

31 

30 

0.00 

Graham  Scan 

while  loop 

t Standard  Error  of  Main  Effects:  ±0.04 


Table  4:  SPT  Rank  for  Image  Benchmark^  8 Processors. 
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major  factors  (on  the  original  code)  as  a function  of  the  number  of  processors. 

It  is  immediately  clear  that  there  is  a serious  load  balancing  problem;  the 
three  factors  {FI,  F2,  F^)  corresponding  to  Connected  Components  have 
gradually  moved  to  the  very  top  of  the  table  as  the  number  of  processors 
increased.  A close  examination  of  the  procedure  confirms  our  suspicion. 
One  processor  is  assigned  to  handle  the  background  pixels  and  hence  ends 
up  doing  most  of  the  work.  A completely  different  scheduling  policy  or  a 
completely  new  algorithm  is  required  before  a significant  improvement  can 
be  made.  Even  by  making  slight  modifications,  we  were  able  to  improve  the 
performance  of  this  procedure. 

We  now  show  the  results  of  the  SPT  analysis  when  performed  on  our 
improved  version.  We  have  modified  the  Gradient  procedure  as  indicated 
earlier  and  have  made  some  simple  modifications  to  the  Connected  Com- 
ponents procedure.  Tables  9 and  10  show  a summary  of  the  SPT  analysis 
when  performed  on  our  improved  version.  Notice  that  the  Gradient  proce- 
dure (rank=8  with  8 processors,  and  rank=17  on  24  processors)  is  no  longer 
a significant  bottleneck  and  that  the  Median,  Connected  Components,  and 
Probe  contribute  much  more  significantly  to  the  overall  running  time  when 
the  number  of  processors  increases  beyond  eight.  Using  eight-processors,  our 
version  runs  18.2%  faster  than  the  original  version. 

4.2  Parallel  Quicksort 

The  image  processing  benchmark  provided  insights  on  how  SPT  can  be  used 
to  handle  large  applications.  It  successfully  detected  code  inefficiencies  and  a 


Rank 

Factor 

Main  Effect  f 

Routine 

Construct 

1 

17 

24.04 

Gradient  Magnitude 

for  loop 

2 

26 

22.18 

Median  Filtering 

while  loop 

3 

25 

8.27 

Median  Filtering 

for  loop 

4 

2 

5.26 

Connected  Components 

for  loop 

5 

1 

4.50 

Connected  Components 

function 

6 

4 

4.34 

Connected  Components 

while  loop 

t Standard  Error  of  Main  Effects:  ±0.04 


Table  5:  SPT  Rank  for  Image  Benchmark^  2 Processors. 
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Rank 

Eactor 

Main  Effect  f 

Routine 

Construct 

1 

17 

12.06 

Gradient  Magnitude 

for  loop 

2 

26 

11.09 

Median  Eiltering 

while  loop 

3 

2 

5.27 

Connected  Components 

for  loop 

4 

4 

4.35 

Connected  Components 

while  loop 

5 

1 

4.33 

Connected  Components 

function 

6 

25 

4.09 

Median  Filtering 

for  loop 

I Standard  Error  of  Main  Effects:  ±0.04 


Table  6:  SPT  Rank  for  Image  Benchmark^  4 Processors. 


Rank 

Factor 

Main  Effect  | 

Routine 

Construct 

1 

17 

6.03 

Gradient  Magnitude 

for  loop 

2 

26 

5.46 

Median  Filtering 

while  loop 

3 

2 

5.26 

Connected  Components 

for  loop 

4 

1 

3.94 

Connected  Components 

function 

5 

4 

3.84 

Connected  Components 

while  loop 

6 

25 

2.01 

Median  Filtering 

for  loop 

f Standard  Error  of  Main  Effects:  ±0.04 


Table  7:  SPT  Rank  for  Image  Benchmark,  8 Processors. 


Rank 

Factor 

Main  Effect  f 

Routine 

Construct 

1 

2 

5.43 

Connected  Components 

for  loop 

2 

1 

3.95 

Connected  Components 

function 

3 

4 

3.93 

Connected  Components 

while  loop 

4 

17 

2.14 

Gradient  Magnitude 

for  loop 

5 

26 

2.03 

Median  Filtering 

while  loop 

6 

20 

1.55 

Match 

function 

t Standard  Error  of  Main  Effects:  ±0.12 


Table  8:  SPT  Rank  for  Image  Benchmark,  24  Processors. 
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Rank 

Factor 

Main  Effect  f 

Routine 

Construct 

1 

26 

5.57 

Median  Filtering 

while  loop 

2 

2 

5.30 

Connected  Components 

for  loop 

3 

1 

4.06 

Connected  Components 

function 

4 

4 

3.93 

Connected  Components 

while  loop 

5 

25 

2.04 

Median  Filtering 

for  loop 

6 

20 

1.64 

Match 

function 

7 

29 

1.23 

Probe 

for  loop 

8 

17 

0.59 

Gradient  Magnitude 

for  loop 

9 

6 

0.59 

Extract  Cues 

for  loop 

t Standard  Error  of  Main  Effects:  ±0.04 


Table  9:  SPT  Rank  for  Improved  Image  Benchmark,  8 Processors. 


Rank 

Factor 

Main  Effect  f 

Routine 

Construct 

2 

2 

5.85 

Connected  Components 

for  loop 

4 

4 

4.66 

Connected  Components 

while  loop 

3 

1 

3.80 

Connected  Components 

function 

1 

26 

2.31 

Median  Filtering 

while  loop 

7 

29 

2.22 

Probe 

for  loop 

9 

6 

1.82 

Extract  Cues 

for  loop 

17 

17 

0.60 

Gradient  Magnitude 

for  loop 

t Standard  Error  of  Main  Effects:  ±0.76 


Table  10:  SPT  Rank  for  Improved  Image  Benchmark,  24  Processors. 
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load  imbalance.  However,  synchronization  and  critical  sections  did  not  play 
a significant  role.  In  this  section,  we  discuss  a parallel  version  of  the  quicksort 
algorithm  and  illustrate  how  SPT  can  be  used  to  address  bottlenecks  due  to 
synchronization  and  critical  sections. 

The  test  code  is  a parallel  implementation  of  Hoare’s  quicksort  algorithm[16]. 
Quicksort  is  a scheme  that  is  based  on  partitioning  a given  list  into  two  sub- 
lists relative  to  a selected  member  of  the  list,  called  the  pivot.  Elements  of 
the  list  are  rearranged  such  that  all  elements  smaller  than  the  pivot  are  to  the 
left  of  the  pivot  and  all  elements  greater  than  the  pivot  are  to  the  right  of  the 
pivot.  There  are  several  ways  of  choosing  the  pivot  to  induce  approximately 
equal  partitions.  We  refer  to  a such  partitioning  step  as  a pass.  Hence  after  a 
pass,  the  pivot  value  is  positioned  in  its  sorted  order.  This  procedure  is  then 
applied  recursively  to  each  sublist.  Once  a sublist  becomes  small  enough,  it 
can  be  sorted  by  using  a simple  sorting  routine,  say  selection  sort  or  bubble 
sort. 

A simple  way  to  parallelize  the  quicksort  procedure  is  to  allocate  newly- 
created  sublists  to  available  processors  (see[4]  for  a more  involved  paralleliza- 
tion of  quicksort).  A sublist  assigned  to  a processor  is  then  partitioned  into 
two  sublists  by  that  processor.  The  allocation  of  sublists  to  processors  is 
controlled  by  a shared  stack.  An  idle  processor  asks  for  a sublist  from  the 
shared  stack.  To  insure  that  no  two  processors  take  possession  of  the  same 
sublist,  the  stack  access  is  controlled  by  a critical  section. 

The  following  is  a skeleton  of  the  program  code  for  a simple  implementa- 
tion of  quicksort. 
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Initializations; 
Put  list  on  stack; 


barrier();  /*  barrier  ^^1  */ 
while(stack  is  not  empty)  { 

barrierf);  /*  barrier  ^2  */ 

/ocA’(stack  Jock); 

if(stack  is  not  empty) 
pop(); 

un/oc/i’(  stack  Jock); 

Select  a pivot  and  partition  current  list  into  sublists  Li  and  L2; 

ifdiil  > lid)  { 

/ocfc(stack  Jock); 

push(L2); 

push(Li); 

un/oc^(stackJock);  } 
else  { 

/ocA:(  stack  Jock); 
push(Li); 

push(L2); 

unlock{steickJock);  } 
barrier();  /*  barrier  ^3  */ 

} 


Our  tuning  effort  of  quicksort  begins  by  investigating  the  cost  of  synchro- 
nization. There  are  three  synchronization  points,  denoted  as  barrier ().  The 
first  barrier  insures  that  all  initializations  are  complete  before  the  processes 
begin  executing  the  while  loop.  The  two  barriers  within  the  main  loop 
synchronize  the  processes  before  and  after  each  pass.  This  implementation 
makes  it  easy  to  determine  when  the  sort  is  completed.  Our  SPT  objective 
is  to  find  out  if  processes  are  arriving  at  widely  dispersed  times,  and  hence 
causing  many  processors  to  idle  for  a significantly  long  period  of  time.  Our 
investigation  follows  the  treatment  method  presented  in  Section  3.  For  each 
synchronization  barrier,  two  types  of  perturbations  are  inserted,  one  imme- 
diately before  {FBI)  and  the  other  immediately  after  {FB2).  The  method  is 
illustrated  by  the  following  code  segment; 
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Paired  Factor 

Main  Effect 

Difference  | 

barrier 
pair  1 

FBI:  0.16 

0.06 

FB2:  0.22 

barrier 
pair  2 

FBI:  14.22 

1.00 

FB2:  15.22 

barrier 
pair  3 

FBI:  6.78 

8.56 

FB2:  15.34 

I Standard  Error  of  the  Difference:  ±0.21 


Table  11:  Paired  Effects  for  Quicksort’s  Barriers. 

/ocA:(  sptJock_l);  /*  */ 

spt  .delay  (sptjdelay);  /*  FBI  */ 

'un/ocA:(spt  Jock_l);  /*  treatment  */ 

barrier();  /*  original  barrier  */ 

/ocA:(  sptJock.l);  /*  */ 

spt_delay(sptjdelay);  /*  FB2  */ 

«n/oc^(spt  JockJ.);  /*  treatment  */ 

harrier  ();  /*  */ 


The  three  synchronization  barriers  are  instrumented  as  shown  above. 
This  implementation  demands  six  factors,  two  for  each  barrier  tested.  The 
experiments  proceed  as  before;  an  experimental  plan  is  created  and  tested. 
The  resultant  is  an  effect  measure  for  all  six  factors.  The  interpretation  of 
the  results  differ  slightly  in  that  we  now  want  to  compare  the  effects  of  the 
factors  before  and  after  each  barrier.  Table  11  shows  the  results.  The  left- 
most column  of  Table  11  identifies  the  barrier.  The  second  column  gives 
the  calculated  main  effect  for  each  factor.  The  individual  main  effects  are 
meaningless  in  isolation  and  must  be  paired  up  and  compared  to  obtain  the 
proper  information.  The  last  column,  which  contains  the  difference  of  each 
of  the  paired  factors,  gives  an  indication  of  the  cost  associated  with  each  syn- 
chronization. Remember  that  the  treatment  FB2  shows  the  effect  of  an  ideal 
barrier  application,  and  is  very  sensitive  to  delay.  If  the  paired  effects  are 
about  the  same,  we  conclude  that  the  synchronization  cost  is  marginal.  As 
the  difference  in  the  two  effects  increases,  the  synchronization  cost  increases. 
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(Recall  that  FBI  has  less  effect  on  straggling  threads.)  It  follows  that  by 
comparing  the  effects  of  each  pair  of  the  delays  introduced  for  each  synchro- 
nization barrier,  we  will  be  able  to  determine  those  incurring  large  overheads. 
It  should  be  noted  that  this  type  of  experiment  should  be  performed  sepa- 
rately from  a screening  experiment.  The  effects  have  no  relationship  to  the 
screening  for  important  factors  because  the  treatments  are  not  comparable 
in  any  easy  fashion. 

In  spite  of  its  simplicity,  this  example  illustrates  the  effectiveness  and  the 
generality  of  the  SPT  approach.  The  difference  shown  for  the  first  synchro- 
nization barrier  indicates  that  almost  all  processors  arrive  there  at  the  same 
time.  This  is  clearly  the  case  since  only  one  processor  is  responsible  for  the 
initialization  phase  and  the  rest  crowd  around  the  barrier.  Used  only  once, 
the  effects  also  show  that  this  barrier  is  not  very  important  to  performance. 
The  second  synchronization  barrier  is  not  needed  since  the  processors  are 
already  synchronized  at  the  beginning  of  each  pass.  The  test  confirms  what 
algorithm  inspection  tells  us.  The  third  row  of  the  table  indicates  that  the 
third  synchronization  barrier  is  costly  compared  to  the  other  two  synchro- 
nization barriers.  This  is  because  processors  are  working  on  different-length 
sublists  (or  no  sublist  at  all)  and  hence  arrive  at  the  third  synchronization 
point  at  widely  different  times.  The  barrier  deserves  some  attention. 

To  alleviate  the  problem  of  synchronization  at  the  end  of  the  while  loop, 
we  rewrite  the  code  following  the  skeleton  shown  next.  The  resulting  im- 
provement in  performance  is  substantial  (78%  ). 


25 


Initializations; 

Put  list  on  stack; 
barrier  ( ); 

for()  { 

/ocA:(stackJock);  /*  CSl  */ 
if(stack  is  not  empty) { 
pop(); 

} 

^in/ocA:(  stack  Jock); 
if(!qsort_done)  { 

Select  a pivot  and  partition  current  list  into  sublists  Li  and  L2; 
ifdiil  > |L,|)  { 
lock{sta,ckJock);  /*  CS2  */ 
push(l2); 

push(Li); 

Mn/oc/;:(  stack  Jock); 

} else  { 

/ocA:( stackJock);  /*  CSS  */ 
push(Li); 

push(L2); 

Mn/oc/:(stackJock); 

} 

} 

} 


In  the  next  experiment,  SPT's  objective  is  to  obtain  the  relative  im- 
portance (detrimental  effect)  of  the  three  new  critical  sections  (CSl,  CS2, 
CSS).  A delay  is  inserted  in  each  critical  section.  An  experimental  plan  is 
developed  and  run.  Table  12  shows  the  SPT  performance  information  for 
each  critical  section.  By  looking  at  the  program,  it  is  not  clear  which  crit- 
ical section  presents  the  main  bottleneck  among  the  three  critical  sections. 
Our  SPT  analysis  shows  that  the  first  critical  section  dominates.  At  this 
point,  we  remove  the  other  two  factors  from  further  consideration,  and  per- 
form a complete  SPT  analysis  that  includes  the  factor  (labelled  Tl)  of  the 
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Rank 

Eactor 

Main  Effect  f 

Routine 

Construct 

1 

1 

13.82 

main( ) 

critical  section  1 

2 

3 

2.86 

main() 

critical  section  3 

3 

2 

2.62 

main() 

critical  section  2 

I Standard  Error  of  Main  Effects:  ±2.15 


Table  12:  SPT  Rank  for  Quicksort’s  Critical  Sections. 


Rank 

Eactor 

Main  Effect  f 

Routine 

Construct 

1 

E3 

29.01 

partition  Jist( ) 

while  loop 

2 

F7 

8.94 

swapO 

function 

3 

E4 

1.26 

push() 

function 

4 

E6 

0.69 

bubble_sort() 

while  loop 

5 

El 

0.14 

main() 

critical  section  1 

6 

F2 

0.11 

select_pivot() 

function 

7 

F5 

0.09 

P0P() 

function 

I Standard  Error  of  Main  Effects:  ±0.39 

Table  13:  SPT  Rank  for  Quicksort. 


critical  section  C5l.  Six  additional  code  segments  are  selected  to  be  tested 
along  with  this  critical  section.  These  are  the  procedures:  partition Jist(), 
bubble^ort(),  swap(),  push(),  pop(),  and  select_pivot().  Since  the  delay  for 
the  critical  section  and  regular  code  segments  are  equivalent,  they  can  be 
compared.  Table  13  shows  the  results  obtained.  Clearly  factors  F3  and  F7 
dominate  the  overall  performance.  Based  upon  this  data,  we  examine  the 
procedure  partition- list()  which  calls  the  swap()  procedure.  Removing  the 
calls  to  swap()  and  inserting  its  code  into  partition Jist()  resulted  in  an  ad- 
ditional 23%  improvement  of  the  execution  time  of  quicksort.  This  improve- 
ment has  been  reported  earlier  in  [1]  via  SPT.  As  shown  in  the  same  paper, 
using  the  UNIX  profiling  tool  ^pro/ would  have  provided  little  information 
for  improving  the  parallel  quicksort  routine. 
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5 Conclusion 


We  have  described  the  tuning  methodology  of  SPT,  Synthetic-Perturbation 
Tuning,  that  is  based  on  a branch  of  statistics  called  design  of  experiments. 
The  main  purpose  of  this  methodology  is  to  identify  performance  bottlenecks 
present  in  MIMD  programs.  SPT  should  provide  the  basis  of  a very  powerful 
tuning  tool  that  is  portable  across  machines  and  architectures.  We  also 
considered  in  some  detail  the  sources  of  poor  performance  on  the  shared 
memory  model  and  showed  how  these  issues  can  be  adequately  captured  using 
SPT.  Two  detailed  case  studies  were  then  discussed  and  their  bottlenecks 
analyzed  using  our  methodology.  Significant  improvements  were  made  based 
on  the  results  of  the  SPT  analysis. 

The  work  presented  here  should  be  viewed  as  a contribution  towards  de- 
veloping a comprehensive  methodology  for  tuning  MIMD  programs  based 
on  the  techniques  of  the  design  of  experiments.  We  are  currently  refining 
and  extending  our  methodology  in  several  directions.  In  particular,  we  are 
analyzing  approaches  to  measure  the  performance  of  memory  hierarchy  in 
a shared  memory  environment,  and  the  communication  overhead  present  in 
a message  passing  environment.  Additional  large  case  studies  are  currently 
being  examined  using  SPT.  Our  future  plans  include  the  development  of  au- 
tomated tools  for  performing  the  SPT  analysis  and  reporting  the  appropriate 
information  to  the  user. 

A minor  disadvantage  of  our  methodology  is  the  amount  of  experimen- 
tation necessary  to  perform  the  analysis.  However,  we  believe  that  tuning 
MIMD  programs  is  a highly  nontrivial  task  requiring  the  capture  of  many 
parameters  and  their  interactions.  Simpler  schemes  are  likely  to  fail  in  one 
aspect  or  another.  The  mathematical  basis  of  our  method  provides  a solid 
foundation  upon  which  we  can  build  general  tuning  techniques  that  are  ap- 
plicable across  machines  and  architectures. 
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